Computing communities in large networks using random walks 



Pascal Pons and Matthieu Latapy 

LIAFA - CNRS and University Paris 7-2 place Jussieu, 75251 Paris Cedex 05, France 

(pons, latapy)@liafa.jussieu.fr 

Abstract 

Dense subgraphs of sparse graphs (communities) , which appear in most real- world complex 
networks, play an important role in many contexts. Computing them however is generally expen- 
sive. We propose here a measure of similarities between vertices based on random walks which 
has several important advantages: it captures well the community structure in a network, it can 
be computed efficiently, and it can be used in an agglomerative algorithm to compute efficiently 
the community structure of a network. We propose such an algorithm, called Walktrap, which 
runs in time 0{mn^) and space O(n^) in the worst case, and in time 0(n'^ logn) and space O(n^) 
in most real-world cases (n and m are respectively the number of vertices and edges in the input 
graph). Extensive comparison tests show that our algorithm surpasses previously proposed ones 
concerning the quality of the obtained community structures and that it stands among the best 
ones concerning the running time. 

Keywords: complex networks, graph theory, community structure, random walks. 

1 Introduction 

Recent advances have brought out the importance of complex networks in many differ- 
ent domains such as sociology (acquaintance networks, collaboration networks), biology 
(metabolic networks, gene networks) or computer science (internet topology, web graph, 
p2p networks). We refer to |45ll42l [Tl l81lll2| for reviews from different perspectives and for 
an extensive bibliography. The associated graphs are in general globally sparse but locally 
dense: there exist groups of vertices, called communities, highly connected between them 
but with few links to other vertices. This kind of structure brings out much information 
about the network. For example, in a metabolic network the communities correspond 
to biological functions of the cell |3H1. In the web graph the communities correspond to 
topics of interest [^[TH] . 

This notion of community is however difficult to define formally. Many definitions 
have been proposed in social networks studies |45j , but they are too restrictive or cannot 
be computed efficiently. However, most recent approaches have reached a consensus, 
and consider that a partition V = {Ci, . . . , Ck} of the vertices of a graph G = {V, E) 
(y/i, Ci C V) represents a good community structure if the proportion of edges inside 
the d (internal edges) is high compared to the proportion of edges between them (see 
for example the definitions given in jl9j). Therefore, we will design an algorithm which 
finds communities satisfying this criterion. More precisley, we will evaluate the quality 
of a partition into communities using a quantity (known as modularity [321 1331 ) which 
captures this. 

We will consider throughout this paper an undirected graph G = [V, E) with n ~ \V\ 
vertices and m = \E\ edges. We impose that each vertex is linked to itself by a loop (we 
add these loops if necessary) . We also suppose that G is connected, the case where it is 
not being treated by considering the components as different graphs. 

1.1 Our approach and results 

Our approach is based on the following intuition: random walks on a graph tend to get 
"trapped" into densely connected parts corresponding to communities. We therefore begin 
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with some properties of random walks on graphs. Using them, we define a measurement 
of the structural similarity between vertices and between communities, thus defining a 
distance. We relate this distance to existing spectral approaches of the problem. But 
our distance has an important advantage on these methods: it is efficiently computable, 
and can be used in a hierarchical clustering algorithm (merging iteratively the vertices 
into communities). One obtains this way a hierarchical community structure that may 
be represented as a tree called dendrogram (an example is provided in Figure ^ . We 
propose such an algorithm, called Walktrap, which computes a community structure in 
time 0{mnH) where H is the height of the corresponding dendrogram. The worst case is 
0{mn^). But most real-world complex networks are sparse (m = 0{n)) and, as already 
noticed in jFj , 7? is generally small and tends to the most favourable case in which the den- 
drogram is balanced {H = 0{\ogn)). In this case, the complexity is therefore 0{n^ logn). 
We finally evaluate the performance of our algorithm with different experiments which 
show that it surpasses previously proposed algorithms in most cases. 

1.2 Related work 

Many algorithms to find community structures in graphs exist. Most of them result from 
very recent works, but this topic is related to the classical problem of graph partitioning 
that consists in splitting a graph into a given number of groups while minimizing the cost 
of the edge cut ^lESlEHI- However, these algorithms are not well suited to our case 
because they need the number of communities and their size as parameters. The recent 
interest in the domain has started with a new divisive approach proposed by Girvan and 
Newman |23II33) : the edges with the largest betweenness (number of shortest paths passing 
through an edge) are removed one by one in order to split hierarchically the graph into 
communities. This algorithm runs in time O(m^n). Similar algorithms were proposed by 
Radicchi et al and by Fortunato et al ^H]- The first one uses a local quantity (the 
number of loops of a given length containing an edge) to choose the edges to remove and 
runs in time 0{m?). The second one uses a more complex notion of information centrality 
that gives better results but poor performances in 0{m?n). 

Hierarchical clustering is another classical approach introduced by sociologists for data 
analysis pi If 5|. From a measurement of the similarity between vertices, an agglomera- 
tive algorithm groups iteratively the vertices into communities (different methods exist, 
depending on the way of choosing the communities to merge at each step). Several ag- 
glomerative methods have been recently introduced and we will use it in our approach. 
Newman proposed in |32j a greedy algorithm that starts with n communities correspond- 
ing to the vertices and merges communities in order to optimize a function called mod- 
ularity which measures the quality of a partition. This algorithm runs in 0{mn) and 
has recently been improved to a complexity 0{mH\ogn) (with our notations) [5]. The 
algorithm of Donetti and Muhoz ^Uj also uses a hierarchical clustering method: they use 
the eigenvectors of the Laplacian matrix of the graph to measure the similarities between 
vertices. The complexity is determined by the computation of all the eigenvectors, in 
0{n'^) time for sparse matrices. Other interesting methods have been proposed, see for 
instance gEl 1^ El 13 IH ■ 

Random walks themselves have already been used to infer structural properties of 
networks in some previous works. Gaume |2f| used this notion in linguistic context. Fouss 
et al [2n] used the Euclidean commute time distance based on the average first-passage 
time of walkers. Zhou and Lipowsky )48| introduced another dissimilarity index based on 
the same quantity; it has been used in a hierarchical algorithm (called Netwalk). Markov 
Cluster Algorithm |43| iterates two matrix operations (one corresponding to random walks) 
bringing out clusters in the limit state. Unfortunately the three last approaches run 
in 0{n^) and cannot manage networks with more than a few thousand vertices. Our 
approach has the main advantage to be significatively faster while producing very good 
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results. 



2 Preliminaries on random walks 

The graph G is associated to its adjacency matrix A: Aij = 1 if vertices i and j are 
connected and Aij = otherwise. The degree d{i) = Aij of vertex i is the number of 
its neighbors (inchiding itself). As we discussed in the introduction, the graph is assumed 
to be connected. To simplify the notations, we only consider unweighted graphs in this 
paper. It is however trivial to extend our results to weighted graphs (Aij S instead 
of Aij e {0, 1}), which is an advantage of this approach. 

Let us consider a discrete random walk process (or diffusion process) on the graph G 
(see |3UI B] for a complete presentation of the topic). At each time step a walker is on a 
vertex and moves to a vertex chosen randomly and uniformly among its neighbors. The 
sequence of visited vertices is a Markov chain, the states of which arc the vertices of the 
graph. At each step, the transition probability from vertex i to vertex j is Ptj = -j^. This 
defines the transition matrix P of random walk processes. One can also write P ~ D^^A 
where D is the diagonal matrix of the degrees (Vz, Da = d{i) and Dij = for i ^ j). 

The process is driven by the powers of the matrix P: the probability of going from 
i to j through a random walk of length t is {P*)ij. In the following, we will denote this 
probability by P/^-. It satisfies two well known properties of the random walk process 
which we will use in the sequel: 

Property 1 When the length t of a random walk starting at vertex i tends towards in- 
finity, the probability of being on a vertex j only depends on the degree of vertex j (and 
not on the starting vertex i): 

We will provide a proof of this property in the next section. 

Property 2 The probabilities of going from i to j and from j to i through a random walk 
of a fixed length t have a ratio that only depends on the degrees d{i) and d{j): 

Vz,Vj,d(z)P|,. =d(j)i^*. 

Proof : This property can be written as the matricial equation DP^D~^ = {P^Y^ (where 
A/-^ is the transpose of the matrix M). By using P = D^^A and the symmetry of the 
matrices D and A, we have: DP*D~^ ^ D{D'''^ Af D-'^ = {AD^'^)* = Ia^{D~^YY = 
{{D~^AfY ^ (P*)^. □ 



3 Comparing vertices using short random walks 

3.1 A distance r to measure vertex similarities 

In order to group the vertices into communities, we will now introduce a distance r between 
the vertices that captures the community structure of the graph. This distance must be 
large if the two vertices are in different communities, and on the contrary if they are in 
the same community it must be small. It will be computed from the information given 
by random walks in the graph. 

Let us consider random walks on G of a given length t. We will use the information 
given by all the probabilities Pfj to go from i to j in t steps. The length t of the random 
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walks must be sufficiently long to gather enough information about the topology of the 
graph. However t must not be too long, to avoid the effect predicted by Property ^ the 
probabilities would only depend on the degree of the vertices. Each probability P/^ gives 
some information about the two vertices i and j, but Property |21 says that P/^ and P*j 
encode exactly the same information. Finally, the information about vertex i encoded 
in P* resides in the n probabilities (P4)i</c<n, which is nothing but the i"^ row of the 
matrix P*, denoted by P*_ . To compare two vertices i and j using these data, we must 
notice that: 

• If two vertices i and j are in the same community, the probability P^^ will surely be 
high. But the fact that P/^ is high does not necessarily imply that i and j are in the 
same community. 

• The probability P/^ is influenced by the degree d{j) because the walker has higher 
probability to go to high degree vertices. 

• Two vertices of a same community tend to "see" all the other vertices in the same 
way. Thus if i and j are in the same community, we will probably have Vfc, Pj*^, ~ P^*^ . 



We can now give the definition of our distance between vertices, which takes into account 
all previous remarks: 

Definition 1 Let i and j be two vertices in the graph and 
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where \\.\\ is the Euclidean norm o/R". 

One can notice that this distance can also be seen as the distance 0] between the 
two probability distributions P/^ and P*^ . Notice also that the distance depends on t 
and should be denoted by (t) . We will however consider it as implicit to simplify the 
notations. 

Now we generalize our distance between vertices to a distance between communities 
in a straightforward way. Let us consider random walks that start from a community: the 
starting vertex is chosen randomly and uniformly among the vertices of the community. 
We define the probability P^^ to go from community C to vertex j in t steps: 
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This defines a probability vector Pp_ that allows us to generalize our distance: 

Definition 2 Let C'l, C2 d V be two communities. We define the distance rc^c^i between 
these two communities by: 



D-^P^ 



D'^Pk. 
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This definition is consistent with the previous one: 
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and we can also define 



the distance between a vertex i and a community C: ric ~ ^'{i}c- 
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3.2 Relation with spectral approaches 

Theorem 1 The distance r is related to the spectral properties of the matrix P by: 

n 

where (Aq)i<q<„ and {va)i<a<n are respectively the eigenvalues and right eigenvectors of 
the matrix P. 

In order to prove this theorem, we need the following technical lemma: 
Lemma 1 The eigenvalues of the matrix P are real and satisfy: 

1 = Ai > As > . . . > A„ > -1 

Moreover, there exists an orthonormal family of vectors {sa)i<a<n such that each vector 
Va = D^2s^ and = D^Sa are respectively a right and a left eigenvector associated to 
the eigenvalue Aq,; 

\/a,PVa — \aVa and P'^Ua — AqUq 

Proof : The matrix P has the same eigenvalues as its similar matrix S ~ D^PD^^ = 
AD^a. The matrix S is real and symmetric, so its eigenvalues Aq, are real. P is a 
stochastic matrix = 1)7 its largest eigenvalue is Ai = 1. The graph G is 

connected and primitive (the gcd of the cycle lengths of G is 1, due to the loops on each 
vertex), therefore we can apply the Perron- Frobenius theorem which implies that P has 
a unique dominant eigenvalue. Therefore we have: |Aq| < 1 for 2 < a < n. 

The symmetry of S implies that there also exists an orthonornal family Sq, of eigen- 
vectors of S satisfying Va,y(3, s'^sp = Sap (where Sap = 1 if a = /3 and otherwise). Wo 
then directly obtain that the vectors Va = Sa and Ua — Sa are respectively a right 
and a left eigenvector of P satisfying u^vp = Sap- Q 



We can now prove Theorem ^ and obtain Property ^ as a corrolary: 

Proof : Lemma n makes it possible to write a spectral decomposition of the matrix P: 

n n n 

P = ^ XaVaUa, and = ^ A*„UqU^, and so P*^- = ^ XaVa{i)Ua{j) 

a—l a—l a—l 

When t tends towards infinity, all the terms a > 2 vanish. It is easy to show that the 
first right eigenvector vi is constant. By normalizing we have Vi,Wi(i) = , and 

V 2 k d-i^) 

Vj, iii(j) = . ''''■•■'^ We obtain Property ^ 

lim Pfj = lim V Xiva{i)ua{j) = vi{i)ui{j) = ^ 

t^+oc a=l 2^k=l"'y'^) 

Now we obtain the expression of the probability vector P*_ : 

n n 

Pi = ^ \aVa{i)Ua ^ ^ \\va{i)Sa 

a—l a—l 

We put this formula into the second definition of rij given in Equation Then we 
use the Pythagorean theorem with the orthonormal family of vectors (sq)i<q<„, and we 
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remember that the vector vi is constant to remove the case q = 1 in the sum. Finally we 
have: 



□ 



This theorem relates random walks on graphs to the many current works that study 
community structure using spectral properties of graphs. For example, j41| notices that 
the modular structure of a graph is expressed in the eigenvectors of P (other than vi) 
that corresponds to the largest positive eigenvalues. If two vertices i and j belong to a 
same community then the coordinates Va{i) and Va{j) are similar in all these eigenvectors. 
Moreover, gUlE^l show in a more general case that when an eigenvalue Xa tends to 1, the 
coordinates of the associated eigenvector Va are constant in the subsets of vertices that 
correspond to communities. A distance similar to ours (but that cannot be computed 
directly with random walks) is also introduced: df {i,j) ~ X]q=2 ^^"i-Ja"! • Finally, 
|lUj uses the same spectral approach applied to the Laplacian matrix of the graph L = 
D-A. 

All these studies show that the spectral approach takes an important part in the 
search for community structure in graphs. However all these approaches have the same 
drawback: the eigenvectors need to be explicitly computed (in time 0{ti?) for a sparse 
matrix). This computation rapidly becomes untractable in practice when the size of the 
graph exceeds some thousands of vertices. Our approach is based on the same foundation 
but has the advantage of avoiding the expensive computation of the eigenvectors: it only 
needs to compute the probabilities P^*-, which can be done efficiently as shown in the 
following subsection. 



3.3 Computation of the distance r 

Once the two vectors P/^ and P^^ are computed, the distance r.y can be computed in 
time 0{n) using Equation Notice that given the probability vectors P^^. and P^^. i 
the distance rc^C2 ^.Iso computed in time 0{n) 

The probability vectors can be computed once and stored in memory (which uses 
0{n?) memory space) or they can be dynamically computed (which increases the time 
complexity) depending on the amont of available memory. We propose an exact method 
and an approximated method to compute them. 



Exact computation 

Theorem 2 Each probability vector P*^ can be computed in time Oitm) and space 0(n). 

Proof : To compute the vector P*_ , we multiply t times the vector P? (Vfc, Pf_ {k) = 5ik) 
by the matrix P. This direct method is advantageous in our case because the matrix P is 
generally sparse (for real-world complex networks) therefore each product is processed in 
time 0{m). The initialization of Pf^ is done in 0{n) and thus each of the n vectors P/^ 
is computed in time 0(n + tm) = 0{tm). □ 



Approximated computation 

Theorem 3 Each probability vector P/, can he approximated in time 0{Kt) and space 
0{K) with an relative error 0{-t=). 
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Proof : We compute K random walks of length t starting from vertex i. Then we 
approximate each probability P/j, by where Nik is the number of walkers that ended 
on vertex k during the K random walks. The Central Limit Theorem implies that this 
quantity tends toward P*^, with a speed when K tends toward infinity. Each 

random walk computation is done in time 0{t) and constant space hence the overall 
computation is done in time 0{Kt) and space 0{K). □ 

The approximated method is only interresting for very large graphs. In the following 
we will consider the exact method for the complexity and the experimental evaluation. 

3.4 Generalizing the distance 

We saw that our distance is directly related to the spectral properties of the transition 
matrix P. We show in this section how one can generalize easily and efficiently this 
distance to use another weighting of the eigenvectors. To achieve this, we only need to 
define different vectors Pi. , all the rest of the approach follows. 

n 

Theorem 4 Let us consider the generalized distance rf^ = ./^(AQ)(uQ(i) — Vaij))"^ 

a=2 

oo 

where f{x) = Cfccc'^ is any function defined by a power series. 

k=0 

Thenrij = ||_D^^_Pi, — D~^Pj, , wherePi, ^'Yl^^Q^^kPi, , can he approximated in time 

oo 

0{rm) and space 0{n) with relative error on each coordinate less than Sr = Ck- 

k=r+l 

Proof : We have Pi. = Y.kLo ^kPt = J2kLo Eq=i Ck\aVaii)sa- Therefore : 



D-^P,. - D~^P, 



oo n 

EE 

k=0 a=2 



CA:A^(Vq(i) - Va(j))Sa 



And we can conclude because the vectors Sa are orthonormal : 

n oo 2 ^ 

E II E CfcA^K(^) - = /'(A„)K(z) - Vo,{j)f = 9% 

a=2 k=0 a=2 

To compute the vectors, we approximate the series to the order r: Pi. ~ Sl=o ^^^Pt. ■ 
We only need to compute the successive powers _P/"_ for < fc < r which can be done in 
time 0{rm) and space 0{7i). □ 

To illustrate this generalization, we show that it directly allows to consider continuous 
random walks. Indeed, the choice of the length of the random walks (which must be an 
integer) may be restrictive in some cases. To overcome this constraint, one may consider 
the continuous random walk process: during a period dt the walker will go from i to 
j with probability Pijdt. One can prove that the probabilities to go from i to j after 
a time t are given by the matrix e^P-id.)^ a given period length t, the associated 
distance is now rf^ = X!a=2 ~ ^a{i)Y which corresponds to a function 

/(x) = e*(-i) = Er=o cfc^' with Ck = 
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4 The algorithm 

In the previous seetion, we have proposed a distance between vertices (and between sets 
of vertices) to capture structural similarities between them. The problem of finding com- 
munities is now a clustering problem. We will use here an efficient hierarchical clustering 
algorithm that allows us to find community structures at different scales. We present an 
agglomerative approach based on Ward's method that is well suited to our distance 
and gives very good results while reducing the number of distance computations. 

We start from a partition Vi = {{v}, v G V} of the graph into n communities reduced 
to a single vertex. We first compute the distances between all adjacent vertices. Then 
this partition evolves by repeating the following operations. At each step k: 

• choose two communities Ci and C2 in Vk according to a criterion based on the 
distance between the communities that we detail later, 

• merge these two communities into a new community C3 = Ci U C2 and create the 
new partition: Vk+i = {Vk \ {Ci, C2}) U {C3}, and 

• update the distances between communities (we will see later that we actually only 
do this for adjacent communities). 

After n — 1 steps, the algorithm finishes and we obtain P„ = {V}. Each step defines 
a partition Pk of the graph into communities, which gives a hierarchical structure of 
communities called dendrogram (see Figure ^b)). This structure is a tree in which the 
leaves correspond to the vertices and each internal node is associated to a merging of 
communities in the algorithm: it corresponds to a community composed of the union of 
the communities corresponding to its children. 

The key points in this algorithm are the way we choose the communities to merge, and 
the fact that the distances can be updated efficiently. We will also need to evaluate the 
quality of a partition in order to choose one of the Vk as the result of our algorithm. We 
will detail these points below, and explain how they can be managed to give an efficient 
algorithm. 

4.1 Choosing the communities to merge. 

This choice plays a central role for the quality of the obtained community structure. 
In order to reduce the complexity, we will only merge adjacent communities (having at 
least an edge between them). This reasonable heuristic (already used in [22 and \U)\ ) 
limits to m the number of possible mergings at each stage. Moreover it ensures that each 
community is connected. 

We choose the two communities to merge according to Ward's method. At each step 
fc, we merge the two communities that minimize the mean of the squared distances 
between each vertex and its community. 



This approach is a greedy algorithm that tries to solve the problem of maximizing ak 
for each k. This problem is known to be NP-hard: even for a given k, maximizing ak 
is the NP-hard "K-Median clustering problem" for K = {n — k) clusters. The 

existing approximation algorithms |15l are exponential with the number of clusters to 
find and unsuitable for our purpose. So for each pair of adjacent communities {Ci,C2}, 
we compute the variation Aa{Ci,C2) of a that would be induced if we merge Ci and C2 
into a new community C3 = Ci U C2. This quantity only depends on the vertices of Ci 
and C2, and not on the other communities or on the step k of the algorithm: 






1&C3 
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Finally, we merge the two communities that give the lowest value of Act. 



4.2 Computing Act and updating the distances. 

The important point here is to notice that these quantities can be efficiently computed 
thanks to the fact that our distance is a Euclidean distance, which makes it possible to 
obtain the two following classical results |26j : 

Theorem 5 The increase of a after the merging of two communities Ci and C2 is directly 
related to the distance rciC2 ^V- 

Proof : First notice that E,6C,(^c,. " . ) = and (|Ci| + \C2\)Ph,. - \Ci\Ph,. + 
|C2|/'c . . Then we consider the distance r as a metric in R" (that contains the probability 
vectors Pc,) associated to an inner product < .|. >. Finally, after some elementary 
computations, we obtain : 

"^fci = X! ^C-a. ^ Pi. I^Ca. ~ >^ X! '^^Ci + (-|(^ I l Ir \\2'^CiC2 

This also holds if we replace Ci by C2 and C2 by Ci. Therefore: 

E2 \ ^ 2 \ ^ 2 \ ^ 2 \ ^ 2 I ^1 1 1 ^2 1 2 

ieCa ieCi jeC2 iGCi 'ieC2 ' ' ' ' 

We deduce the claim by replacing this expression into Equation □ 

This theorem shows that we only need to update the distances between communities 
to get the values of Atr: if we know the two vectors Pci. and Pc2. , the computation 
of Act(Ci, C2) is possible in 0{n). Moreover, the next theorem shows that if we already 
know the three values Acr(Ci,C2), Aa{Ci,C) and Aa{C2,C), then we can compute 
A(t(Ci U C2, C) in constant time. 

Theorem 6 (Lance- WilHams-Jambu formula) // Ci and C2 are merged into C3 = 
Ci U C2 then for any other community C: 

. tr r\ (ICil + \C\)AaiC,,C) + (IC2I + |C|)A^(C2, g) - |C|A^(Ci, C2) 
^"^""^ ' ^ |C,| + |C2| + |C| 

Proof : We replace the four Act of Equation Q by their values given by Theorem [3 We 

+IC2 
\c\ 



multiply each side by "^^'^^^tjS^^^^^^'' and use jCsj = |Ci| + IC2I, and obtain the equivalent 



equation: 



+ \C2\)rl,c = \Ci\rl,c + \C2\rl,c " l^T^^^kc2 



\Cl\ + \C2\ 

Then we use the fact that Pc^, is the barycenter of Pc^. weighted by |Ci| and of P^^. 
weighted by IC2I, therefore: 



\Ci\rl,c + \C2\rl,c = (|Cil + myh^c + \Ci\rl,c, + \C2\rl, 

IC1IIC2I „2 
|Ci| + |C2r CiC2- 



We conclude using jCil^^^^ + |C2|4^C3 = MTin4iC2- ^ 



Since we only merge adjacent communities, we only need to update the values of Act 
between adjacent communities (there are at most m values). These values arc stored in 
a balanced tree in which we can add, remove or get the minimum in ©(logm). Each 
computation of a value of Act can be done in time 0{n) with Theorem [S] or in constant 
time when Theorem can be applied. 
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4.3 Evaluating the quality of a partition. 

The algorithm induces a sequence {Vk)i<k<n of partitions into communities. We now 
want to know which partitions in this sequence capture well the community structure. 
The most widely used criterion is the modularity Q introduced in [321 133j , which relies 
on the fraction of edges ec inside community C and the fraction of edges^ ac bound to 
community C: 

cev 

The best partition is then considered to be the one that maximizes Q. 

However, depending on one's objectives, one may consider other quality criterion of 
a partition into communities. For instance, the modularity is not well suited to find 
communities at different scales. Here we provide another criterion that helps in finding 
such structures. When we merge two very different communities (with respect to the 
distance r), the value Aak = <Jk+i — Cfc at this step is large. Conversely, if Acr^ is large 
then the communities at step fc — 1 are surely relevant. To detect this, we introduce the 
increase ratio rjk'. 

AcTfe (Tfc+l — (Tfe 

?7fc = -T = 

One may then consider that the relevant partitions Vk are those associated with the 
largest values of -qk- Depending on the context in which our algorithm is used, one may 
take only the best partition (the one for which rjk is maximal) or choose among the best 
ones using another criterion (like the size of the communities, for instance). This is an 
important advantage of our method, which helps in finding the different scales in the 
community structure. However we used the modularity (which produces better results to 
find an unique partition and is not specific to our algorithm) in our experimental tests to 
be able to compare our algorithm with the previouly proposed ones. 

4.4 Complexity. 

First, the initialization of the probability vectors is done in 0{mnt). Then, at each step 
k of the algorithm, we keep in memory the vectors P^^ corresponding to the current 
communities (the ones in the current partition). But for the communities that are not 
in Vk (because they have been merged with another community before) we only keep the 
information saying in which community it has been merged. We keep enough information 
to construct the dendogram and have access to the composition of any community with 
a few more computation. 

When we merge two communities Ci and C2 wc perform the following operations: 

• Compute P(CiUC2). = Al+lc.J ^"^^ remove P^^. and P^^. . 

• Update the values of Act concerning Ci and C2 using Theorem El if possible, or 
otherwise using Theorem Ul 

The first operation can be done in 0{n), and therefore does not play a significant role in 
the overall complexity of the algorithm. The dominating factor in the complexity of the 
algorithm is the number of distances r computed (each one in 0{n)). We prove an upper 
bound of this number that depends on the height of the dendrogram. We denote by h(C) 
the height of a community C and by H the height of the whole tree {H = h{V)). 

Theorem 7 An upper bound of the number of distances computed by our algorithm is 
2mH . Therefore its global time complexity is 0{mn{H + t)). 

^inter-community edges contribute for i to each community. 
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Figure 1: (a) An example of community structure found by our algorithm using random 
walks of length t = 3. (b) The stages of the algorithm encoded as a tree called dendrogram. 
The maximum of % and Q, plotted in (c), show that the best partition consists in two 
communities. The maximal values of r/^ show also that communities of different scales may 
be relevant. 

Proof : Let M be the number of computations of Act. M is equal to m (initialization 
of the first Act) plus the sum over all steps k of the number of neighbors of the new 
community created at step k (when we merge two communities, we need to update one 
value of Act per neighbor). For each height I < h < H , the communities with the same 
height h are pairwise disjoint, and the sum of their number of neighbor communities is 
less than 2m (each edge can at most define two neighborhood relations). The sum over 
all heights finally gives M < 2Hm. Each of these M computations needs at most one 
computation of r in time 0{n) (Theorem|3|. Therefore, with the initialization, the global 
complexity is 0{mn{H + t)). □ 

In practice, a small t must be chosen (we must have t = ©(log n) due to the exponential 
convergence speed of the random walk process) and thus the global complexity is 0{mnH). 
We always empirically observed that best results are obtained using length 3 < t < 8. 
We moreover observed that the choice of t in this range is not crutial as the results are 
often similar. Hence we think that a good empirical compromise is to choose t = 4 or 
t = 5. We also advise to reduce this length for very dense graphs and to increase it for 
very sparse ones because the convergence speed of the random walk process increase with 
the graph density. Studying more formally the influence of t, and determining optimal 
values, remains to be done. 

The worst case is H = n — 1, which occurs when the vertices are merged one by one 
to a large community. This happens in the "star" graph, where a central vertex is linked 
to the n — 1 others. However Ward's algorithm is known to produce small communities 
of similar sizes. This tends to get closer to the favorable case in which the community 
structure is a balanced tree and its height is H ~ O{logn). 

However, this upper bound is not reached in practical cases. We evaluated the actual 
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Method 


Number of distances computed 


Upper bounds 


2m{n — 1) 
2mH 


282 000 000 
2 970 000 




without theorem El 


321 000 


Practical tests 


with theorem El 


277000 




with additional heuristics 


103 000 



Table 1: Number of distances computed according to upper bounds and practical tests. 



number of distance computations done on graphs from the test set presented in Section r5.1l 
We chose graphs with n = 3 000 vertices, their mean number of edges is m — 47 000 and 
the mean height of the computed dendrograms is H = 31.6. We compared the worst case 
upper bound 2(77171(71— 1)) and the upper bound 2mnH with the actual number distances 
computed with and without using Theorem jSl 

We also considered an additional heuristics that consists in applying TheoremElwhen- 
ever we only know one of the two quantities Acr(Ci,C) or Aa{C2,C). In this case we 
assume that the other one is greater than the current minimal Act and we obtain a lower 
bound for Act(Ci U C2, C). Later, if this lower bound becomes the minimal Act then we 
compute the exact distance in 0{n). Otherwise if the community C3 = Ci U C2 is merged 
using another community than C the exact computation is avoided. This heuristics can 
induce inexact merging ordering when the other unknown Act is not greater than the 
current minimal Act, we observed in this test that this happened on 0.05% of the cases. 

The results, transcribed in Tabled show that in practical cases, the actual complexity 
of our approach is signihcantly lower than the upper bound we proved. However, this 
upper bound can be reached in the pathological case of the star graph. 

5 Experimental evaluation of the algorithm 

In this section we will evaluate and compare the performances of our algorithm with 
most previously proposed methods. This comparison has been done in both randomly 
generated graphs with communities and real world networks. In order to obtain rigorous 
and precise results, all the programs have been extensively tested on the same large set 
of graphs. 

The test compares the following community detection programs: 

• this paper (Walktrap) with random walk length t = 5 and t = 2, 

• the Girvan Newman algorithm [231 133) (a divisive algorithm that removes larger 
betweeness edges), 

• the Fast algorithm that optimize the modularity proposed by Newman and im- 
proved in (a greedy algorithm designed for very large graphs that optimizes the 
modularity) , 

• the approach of Donetti and Munoz using the Laplacian matrix |10) and its new 
improved version |11| (a spectral approach with a hierarchical algorithm) , 

• the Netwalk algorithm 0H1 (another algorithm based on random walks), 

• the Markov Cluster Algorithm (MCL) |^ (an algorithm based on simulation of 
(stochastic) flow in graphs), 

• and the Cosmoweb algorithm [SI (a gravitational approach designed for web cluster- 
ing). 

We refer to Section IL^ and to the cited references for more details on these algorithms. 
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5.1 Comparison on generated graphs 




Figure 2: Quality and time performance of different approaches in function of the size of the 
graphs (A'^). (Left) Mean quaUty of the partition found {R')- Right: Mean execution time (in 
seconds). 



Evaluating a community detection algorithm is a difficult task because one needs 
some test graphs whose community structure is already known. A classical approach is 
to use randomly generated graphs with communities. Here we will use this approach and 
generate the graphs as follows. 

The parameters we consider are : 

• the number k of communities and their sizes \C'i\ (these parameters give the number 
of vertices N), 

• the internal degree din{Ci) of each community, 

• and the wanted modularity Q. 

In order to reduce the number of parameters, we consider that the external degrees are 
proportional to the internal degrees: Vi, dout{Ci) = P x din{Ci). One can check that the 
expected modularity is then: 

1 l:^{d^n{C^) X \C,\f 

We therefore obtain the wanted modularity by choosing the appropriate value for (3. 

Once these parameters have been chosen, we draw each internal edge of a given com- 
munity with the same probability, producing Erdos-Renyi like communities. Then the 
external degrees are chosen proprotionally to the internal degrees (with a factor j3) and 
the vertices are randomly linked with respect to some constraints (no loop, no multiple 
edge). 
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To evaluate the quality of the partition found by the algorithms, we compare them 
to the original generated partition. To achieve this, we use the Rand index corrected by 
Hubert and Arabic |87l r25j which evaluates the similarities between two partitions. The 
Rand index R{'Pi,'P2) is the ratio of pairs of vertices correlated by the partitions Vi and 
7^2 (two vertices are correlated by the partitions Vi and 7^2 if they are classified in the 
same community or in different communities in the two partitions). The expected value 
of R for a random partition is not zero. To avoid this, Hubert and Arabic proposed a 

corrected index that is also more sensitive : R = -5 — where Rexp is the expected 

value of R for two random partitions with the same community size as Vi and V2 ■ This 
quantity can be efficiently computed using the following equivalent formula : 

R'{Ti,T2) = ^ 

\ i 3 / ' J 

Where (Cf )i<i<fc^ are the communities of the partition Vx and N is the total number of 
vertices. 

This quantity has many advantages compared to the "ratio of vertices correctly identi- 
fied" that has been widely used in the past. It captures the similarities between partitions 
even if they do not have the same number of communities, which is crucial here as we 
will sec below. Moreover, a random partition always gives the same expected value that 
does not depend on the number of communities. 

We also compared the partitions using the modularity. However, the results and the 
conclusions were very similar to those obtained with R' . In order to reduce the size of this 
section and to avoid duplicated information, we only plotted the results obtained with 
the corrected Rand index R' . 



Homogeneous graphs Let us start with the most simple case where all the com- 
munities are similar (same size and same density). Therefore we only have to choose 
the size N of the graphs, the number k of communities, the internal degree din of com- 
munities and the wanted modularity Q. The internal edges are drawn with the same 
probability, producing a Poisson degree distribution. Wc generated graphs corresponding 
to combinations of the following parameters: 

• sizes N in {100, 300, 1 000, 3 000, 10 000, 30 000, 100 000}, 

• number of communities, k = N'' with 7 in {0.3,0.42,0.5}, 

• internal degree, din{Ci) = a\Ti[\Ci\) with a in {2,4,6,8,10}, 

• wanted modularity Q in {0.2, 0.3, 0.4, 0.5, 0.6}. 

The first comparison of the quality and time performances is plotted on Figure |21 For 
each graph size, we plotted the mean corrected Rand index (i?') and the mean running 
time. To avoid that some approaches can be advantaged (or disadvantaged) by particular 
parameters, the mean has been computed over all the possible combinations of the pa- 
rameters listed above. This first comparison shows that our algorithm has the advantage 
of being efficient regarding both the quality of the results and the speed, while other 
alorithms only achieve one of these goals. It can handle very large graphs with up to 
300 000 vertices (this limitation is due to its memory requirements). Larger graphs can 
be processed (without the same quality of results) with the Fast Modularity algorithm 
that has been able to process a 2 million vertex graph. 

We also plotted R' on Figure 01 to observe the influence of the modularity of the gen- 
erated partition on the results. These first tests show that most previously proposed 
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Q Q Q 

Figure 3: Quality of the partition found in function of the modularity of the generated 
partition for different sizes N (same legend as Figure |2I). 



approaches have good performances on small graphs. But our approach is the only one 
that allows to process large graphs while producing good results. Notice that the im- 
proved approach of Donetti and Munoz also produces very good results but requires more 
computational time. This improved version ^Tj uses exactly the same eigen vectors as the 
ones we use in our algorithm, which explains that the quality of the results are similar. 
The MCL algorithm was difficult to use in this intensive test since the user must choose 
a granularity parameter for each input graph, which is a limitation of this algorithm. We 
manually chose one parameter for each size of graph (hence the results are not optimal 
and it can explain their fluctuations), doing our best to find a good one. 

It is also interesting to compare the distribution of the size of the communities found to 
the size of the generated communities. We plotted these quantities on Figure0]for graphs 
with N ~ 3000 vertices. We generated graphs with three different sizes of communities 
and the results can explain the limitations of some approaches. It seems for instance that 
the Fast Modularity algorithm |H] produces communities that always have the same size 
independantly of the actual size of the communities. Likewise, Cosmoweb [0] produces 
too many very small communities (1 to 4 vertices). 

Heterogeneous graphs The second set of graphs has different kind of communities 
(different sizes and different densities). The sizes of the communities are randomly chosen 
according to a power law and the internal densities of each community is also randomly 
chosen. We therefore have the two following additional parameters: 

• the range of internal degree, din{Ci) is uniformely chosen between a„iiri IndCil) and 
ckmaa; ln( I Ci I) with {a,nin , oiraax) = (5,7), (4,8) and (3,9), and 

• the community size distribution is a power law of exponent a in 2.1, 2.5 and 3.^ 

■^Tfie community sizes are cliosen witfiin a range [Smin--Smax\ and the probability tliat a community has 
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Figure 4: Distribution of the size of the communities for three different numbers of generated 
communities corresponding to 11, 30 or 55 communities on = 3000 vertex graphs. 



To study the influence of the heterogeneity of the communities, we generated graphs of 
size TV ~ 3000 with all combinations of the previous parameters (modularity, number 
of communities) and of the two new ones. The three values of the above parameters 
correspond to three levels of heterogeneity. Figure |S1 shows that our approach is not 
influenced by the heterogeneity of the communities, whereas the others are. 

5.2 Comparison on real world networks 

To extend the comparison between algorithms, we also conducted experiments on some 
real world networks. However judging the quality of the different partiton found is very 
difficult because we do not have a reference partition that can be considered as the actual 
communities of the network. We only compared the value of the modularity found by the 
different algorithms. The results are reported in Table |21 
We used the following real world networks : 

• The Zachary's karate club network 07] , a small social network that has been widely 
used to test most of the community detection algorithms. 

• The college football network from 

• The protein interaction network studied in |27| . 

• A scientists collabaration network computed on the arXiv database |50) . 

• An internet map provided by Damien Magoni )24| . 

• The web graph studied in |2] 

size S is actualy proportional to (S + fi)" , with jj. chosen such that the expected size of the overall graph is 
equal to a given A'^. 
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Figure 5: Influence of the heterogeneity of the graphs (for four sizes of graphs N = 
100, 300, 1 000, 3 000). On the x axis, left corresponds to homogeneous graphs and right cor- 
responds to very heterogeneous graphs. The quality of the partition is plotted as a function 
of different parameters as described in the text. Top: internal density given by the range 
. Bottom: community sizes given by the exponent of the power law distribution. 



Or, 



graph 


karate 


foot 


protein 


arxiv 


internet 


www 


nb vertices/mean degree 


33/4.55 


115/10.7 


594/3.64 


9377/5.14 


67882/8.12 


159683/11.6 


Walktrap (t 5) 


0.38/Os 


0.60/Os 


0.67/0.02S 


0.76/4.61S 


0.76/1030S 


0.91/5770S 


Walktrap (t = 2) 


0.38/Os 


0.60/Os 


0.64/O.Ols 


0.71/1. 08s 


0.69/273S 


0.84/468S 


Fast Modularity 


0.39/Os 


0.57/Os 


0.71/Os 


0.77/1. 65s 


0.72/483S 


0.92/1410S 


Donctti Munoz 


0.41/Os 


0.60/Os 


0.59/0.34S 


0.66/1460S 






Donctti Munoz (Laplacian) 


0.41/Os 


0.60/Os 


0.60/1.37S 


0.62/1780S 






Cosmoweb 


-0.05/Os 


0.33/Os 


0.50/0.02S 


0.60/0.65 


0.47/6.82S 


0.79/21S 


Girvan Newman 


0.40/Os 


0.60/0.39S 


0.70/6.93S 


>40000s 






Netwalk 


0.40/0.02S 


0.60/0.07S 


0.60/5. 2s 


>40000s 






Duch Arenas 


0.41/Os 


0.60/0.05S 


0.69/1.9S 


0.77/14000S 






MCL 


0.36/Os 


0.60/0.05S 


0.66/0. 58s 


0.73/61. 3s 







Table 2: Performances on real world networks (modularity / time (in seconds)). The second 
line shows the size of the graphs given by their number of vertices and their mean degree. 



We reduced the sizes of these networks by only keeping the largest connected component 
and by iteratively removing all the one-degree vertices (which do not provide significant 
information on community structures). This allowed us to run the comparison tests with 
all the algorithms on smaller networks (Table [21 reports the size and the mean degree of 
the graphs after this processing). 
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6 Conclusion and further work 



We proposed a new distance between vertices that quantify their structural similarity 
using random walks. This distance has several advantages: it captures much information 
on the community structure, and it can be used in an efficient hierarchical agglomerative 
algorithm that detects communities in a network. We designed such an algorithm which 
works in the worst case in time C(mn^). In practice, real-world complex networks are 
sparse (m = 0{n)) and the height of the dendrogram is small {H = 0{\ogn)); in this 
case the algorithm runs in 0{n? logn). An implementation is provided at |49j . 

Extensive experiments show that our method provides good results in various condi- 
tions (graph sizes, densities, and number of communities). We used such experiments to 
compare our algorithms to the main previously proposed ones. This direct comparision 
shows that our approach has a clear advantage in term of quality of the computed parti- 
tion and presents the best tradeoff between quality and running time for large networks. 
It however has the limitation of needing quite a large amount of memory, which makes 
the Fast Modularity approach a relevant challenger of our method for very large graphs 
(million vertices). 

Our method could be integrated in a multi-scale visualization tool for large networks, 
and it may be relevant for the computation of overlapping communities (which often 
occurs in real- world cases and on which very few has been done until now |84)'l. We 
consider these two points as promising directions for further work. Finally, we pointed 
out that the method is directly usable for weighted networks. For directed ones (like the 
important case of the web graph), on the contrary, the proofs we provided are not valid 
anymore, and random walks behave significantly differently. Therefore, we also consider 
the directed case as an interesting direction for further research. 
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